InfoVis 2003 Contest - InfoZoom Entry
Michael Spenke, Christian Beilken
{Michael.Spenke,Christian.Beilken}@fit.fraunhofer.de
FIT - Fraunhofer Institute for Applied
Information Technology
See Infovis 2003 Contest rules and task at http://www.cs.umd.edu/hcil/iv03contest/
Ratings used below: (Strength,Possible,Difficult,Not Available)
Pairwise comparisons of trees: Topological
changes
Did anything change, in general, or in a
subtree?
- Rating:
- Strength
- Process:
- A simple side-by-side comparison of two or more trees gives a
first impression of the differences.
- Marking a cell in one tree also marks it in the other tree.
- This makes it easy to compare the size of corresponding cells.
(First
image)
- In order to compare two or more trees precisely we first need a
mapping between the nodes of the trees,
- which defines when two nodes in different trees are regarded as identical.
- In the animal classification trees each animal has a
Latin name which is unique within a tree.
- Therefore, we consider two animals in different trees as
identical, if and only if they have the same Latin name.
- Using the the derived attribute Count(Tree) per Latin Name
we can exactly determine which animals are found in both trees
and which are only contained in one of the trees.
- This is explained in detail further
below in the answer to the application specific question
- "To what extent are the differences in the classifications due
to differences in how animals are thought to be related?".
The attribute Latin Path contains the full path name of each
animal.
- Therefore, the derived attribute Count(Latin Path) per Latin
Name can be used to find all animals that are differently
classified in the
two trees.
- This is also explained in detail further below in the answer to
the
question mentioned above.
- In the file system logs there is no unique identification
of a file besides the full path name.
- Consequently, it is impossible to find files that were renamed or
moved to another directory.
- We can only see that some files are missing in a later snapshot
and new files have appeared.
- These files can be exactly determined using the derived
attributes List(week) per file path and
- Count(week) per file path. We simply zoom on all files
which do not appear in all 5 snapshots.
- (Second, third, and fourth image)
Another approach to find the differences between the snapshots is based
on the creation and modification times of the files.
- This is explained in detail further below
in the answer to the application specific question
- "Were there a lot of pages created recently? If so, in which
part of the file system?"
- Image:
- The two animals trees side by side
- Most files are found in all 5 snapshots.
- Some files are found in only 4 or less snapshots.
- Overview of the files found in only 4 or less snapshots.
- Answer:
- 4701 files do not appear in all 5 snapshots.
- Most of them are found in the toplevel directories users,
usersrschulz, and class, especially in class/spring2003.
-
What nodes were added, deleted?
- Rating:
- Strength
- Process:
- We can exactly determine the nodes that are contained in one tree
but not in the other.
- For details see previous question.
- Image:
- See previous question
- Answer:
- See previous question
Did any node or subtrees "move" in the
tree? Can
you characterize those movements?
- Rating:
- Difficult
- Process:
- Using the derived attribute Count(Latin Path) per Latin Name,
we can exactly determine the animals that are contained in
both trees but with a different classification.
- For details see first question.
It is not, however, possible to automatically decide if the differences
found are the result of the move of a complete subtree.
- Some manual browsing is necessary here.
- Image:
- See first question
- Answer:
- See first question
Pairwise comparisons of trees: Attribute
value changes
Global impression: did things change a lot
or not?
- Rating:
- Strength
- Process:
- We define the derived attributes Average(hitCount) per week
and Average(size in KB) per week and sort by week.
- Image:
- Answer:
- The average file size increases from week to week
- The average hit count increases after week B.
What nodes or subtrees changed the most?
- Rating:
- Strength
- Process:
- To answer this question, we defined a derived attribute Minimum(hitCount)
per file path / Maximum(hitCount) per file path.
After a zoom on the highest values of this attribute, we see the files
with the highest increase rates of hitCount within the five
snapshots.
- Image:
- Answer:
- See image.
-
Did the value of attribute XYZ for this
node
increase or decrease? In absolute terms, or relatively to other
siblings
or other nodes.
- Rating:
- Strength
- Process:
- We zoomed on the file /index.html in all 5 weeks.
- Then we sorted the table by week.
- Image:
- Answer:
- The values for attribute hitCount show an increase from
week B to E.
-
General visualization of trees: Topology
Overall characteristics: How large is the
tree? How many levels deep? What is the deepest branch? Does the depth
vary
between subtrees or not?
- Rating:
- Possible
- Process:
- The number of currently displayed objects is always shown in the
upper left corner of the table.
- In our representation of the tree as a table, each animal has 19
attributes for the levels from Phylum down to Species.
For each animal some of these attributes do not have a value, i.e. the
value is the empty string.
The number of non-empty values of an animal corresponds to its branch
length.
We defined a derived attribute Branch Length with the following
somewhat long but straightforward formula:
if([Kingdom] = "",0,1) + if([Phylum] =
"",0,1) + if([Subphylum] = "",0,1) + if([Superclass] =
"",0,1) +
if([Class] = "",0,1) + if([Subclass] = "",0,1)
+ if([Infraclass] = "",0,1) + if([Superorder] =
"",0,1) +
if([Order] = "",0,1) + if([Suborder] = "",0,1) +
if([Infraorder] = "",0,1) + if([Superfamily] = "",0,1) +
if([Family] = "",0,1) + if([Subfamily] = "",0,1) +
if([Tribe] = "",0,1) + if([Subtribe]
= "",0,1) +
if([Genus] = "",0,1) + if([Subgenus] = "",0,1) +
if([Species] = "",0,1)
- Image:
Latin classification
Common classification
- Answer:
- The attribute Subtribe almost never has a value. This
means that this level is missing for most of the animals.
- The deepest branch has 14 named levels.
We can exactly zoom on the animals with this branch depth.
For example: /Arthropoda/Hexapoda/Insecta/Pterygota/Neoptera/Hymenoptera/Apocrita/Scolioidea/Formicidae/Formicinae/Camponotini/Polyrhachis/Polyrhachis
tibialis/Polyrhachis tibialis robustior
- In the classification with common names most of the levels do not
have a name.
-
Path: What is the path of this node?
- Rating:
- Strength
- Process:
- The path of an animal is given by its attribute values and also
by
its attribute Latin Path.
- Image:
- Answer:
- Not applicable
-
Local relatives: What are the children,
siblings, or cousins of this node?
- Rating:
- Strength
- Process:
- This can be directly seen.
- Image:
- Answer:
- Brachycera and Nematocera are the children of Diptera.
- Coleoptera, Hymenoptera, and Trichoptera are
siblings of Diptera.
- Heteroptera is a cousin of Diptera.
- Some siblings and cousins are too small to read. We can
look
into the value menu or zoom on them to see their names.
Filtering by level: Show only the first
level, or show only 3 levels down, or remove all the leaves
- Rating:
- Strength
- Process:
- Attributes can be hidden temporarily or permanently.
- The projection on a subset of the attributes can be constructed
as a new table
- Image:
- Levels below Class are hidden
- Projection on the top 5 levels
- Answer:
- In the first image some levels are hidden, but the number of
animals remains unchanged.
- The second image shows the projection on the toplevels until Class.
There are only 112 columns because a column corresponds to a Class and
not to an animal any more.
Cell sizes differ from the first image, because the width is
proportional to the number of classes here.
-
Topologies question that involve counting
nodes can be seen as attribute dependant questions: e.g. Which branch
contains the largest number of nodes? or Which branch has the largest
fan-out?
- Rating:
- Strength
- Process:
- Just look for the widest cells in each row. If you are unsure:
Open the value menu and sort by frequency. Shown in the image for Order.
- Image:
- Answer:
- The subtrees with the largest fanouts are the Phylum Arthropoda,
the Subphylum Hexapoda, the Class Insecta, the
Subclass Pterygota, and the Superorder Neoptera.
Within the orders Diptera wins with 20158 entries, followed by Hymenoptera
with 15117 entries.
-
General visualization of trees: Attribute
based
Find nodes with high values of a numerical
attribute X? (relative query)
- Rating:
- Strength
- Process:
- Switch to Overview Mode.
- Select a range at the right end of the attribute's row and zoom
in.
- Repeat until the the rightmost cell is large enough to display
the value.
- Alternatively, open the value list dialog and sort it backwards.
The highest value is displayed on top
- Image:
- Answer:
- Not applicable
-
Find nodes with given value of a numerical
attribute X? (absolute query)
- Rating:
- Strength
- Process:
- Switch to Overview Mode.
- If the value can be already seen, just double-click it.
- Otherwise select a rough range around the value and zoom in,
possibly in several steps.
- Alternatively, the value can be directly selected in the value
list dialog, which might be very long, however.
- Image:
- Answer:
- Not Applicable
-
Find nodes with value Y of categorical
attribute X - What value of a categorical attribute occurs more often?
e.g. Are there more farm animals or pets?
- Rating:
- Strength
- Process:
- In the Overview Mode the value distribution of all
attributes is shown.
- The width of each cell is proportional to the number of files
with this
value.
- For cells that are too small we can lookup the size in the value
list dialog.
- The value list can be sorted by frequency, so that the largest
values/cells are on top.
- Image:
- Answer:
- html, gif, and jpg are the most frequent file
formats.
-
Find nodes with certain values of two or
more attributes (What video file is used the most?)
- Rating:
- Strength
- Process:
- Open the value list dialog of attribute extension.
- Select the video formats like avi, mpg, mov
and zoom in.
- Zoom on the highest values of hitCount.
- Image:
- Answer:
- /projects/hcil/kiddesign/icdl/icdl.mpg is used the
most.
-
Number of nodes in a tree or subtree? (How
many animals? How many mammals?)
- Rating:
- Strength
- Process:
- There are several possibilities
- Mark a cell. The tool tip will show the number of animals
represented by this cell
- Open the value list dialog box. The frequency of each value
is displayed there.
- Zoom on the subtree. The number of displayed objects is shown
in the upper left corner of the table.
- Image:
- Answer:
- The subtree of Insecta contains 64423 animals.
-
Comparison of branches of the tree
(Subtrees
with most nodes; are there more mammals or fish?)
- Rating:
- Strength
- Process:
- Image:
- Answer:
- There are more bony fishes than mammals.
-
Largest fanout (What is the largest group
of
animals with same lineage?
- Rating:
- Strength
- Process:
- Just look for the widest cells in each row. If you are unsure:
Open the value menu and sort by frequency. Shown in the image for Order.
- Image:
- Answer:
- The subtrees with the largest fanouts are the Phylum Arthropoda,
the Subphylum Hexapoda, the Class Insecta, the
Subclass Pterygota, and the Superorder Neoptera.
Within the orders Diptera wins with 20158 entries, followed by Hymenoptera
with 15117 entries.
-
General visualization of trees: Known items
Which nodes have a particular string in
their label? (Find "giraffe" in a tree of animals)
- Rating:
- Strength
- Process:
- We perform a full-text search in all attributes for "giraffe".
- This zooms on all animals that contain "dolphin" in at
least
one of its attributes.
- Image:
- Answer:
- There are only two animals with giraffe in their common
names.
-
Locate a node knowing its path
- Rating:
- Strength
- Process:
- Just click onto the cells containing the next label. If the cell
is too small select a range containing the label or use the value menu.
- Image:
- Not Applicable
- Answer:
- Not Applicable
-
Go back to a node you have visited before
- Rating:
- Strength
- Process:
- There are several techniques:
- Use the history mechanism by pressing the back button
until you find the node.
- For book marking insert a new attribute while you are zoomed
onto the node.
The visible records get a number within this new attribute, while other
nodes
get an empty entry. This makes it easy to zoom onto this node again.
- Insert and name a query that saves the steps, which lead to
this node.
This works like a macro, which can replay the steps to this node at any
time.
- Image:
- Not Applicable
- Answer:
- Not Applicable
-
General visualization of trees: Labeling
Review all the labels in a subtree
- Rating:
- Strength
- Process:
- First we zoom on the subtree, e.g. Insecta.
- For each rank we can get a popup-window with a list of all labels.
- Image:
- Answer:
- The image shows a list of all species in the Insecta subtree.
-
General visualization of trees: Browsing
Explore the tree by performing a series of
up and downs in the tree
- Rating:
- Strength
- Process:
- This is done by zoom-in and zoom-out operations.
- Video:
- Click to see video
- Answer:
- Not Applicable
-
General visualization of trees: Managing the
analysis
Marking nodes of interest
- Rating:
- Possible
- Process:
- Any subset of the displayed cells can be marked (selected) using
the mouse (click, drag, ctrl-click, shift-click).
- However, the marking is quite volatile: The next mouse click into
the table will remove it.
- Another way to mark a set of records is to create a new attribute
interesting and to set its values to yes or
no.
- This rests on the fact that InfoZoom is also a very powerful
editor:
- We can simply select a cell or a range of cells and directly edit
its values like in a spreadsheet.
- The modification is performed in all records represented by the
cell.
- In this way, thousands of records can be modified in a single
operation.
Once the attribute interesting is defined, we can later zoom
on just the interesting values.
- Image:
- Answer:
- Not Applicable
-
Removing special anomalies
- Rating:
- Strength
- Process:
- InfoZoom is also a very powerful editor.
- We can simply select a cell and directly edit its value like in a
spreadsheet.
- The modification is performed in all records represented by the
cell.
- In this way, thousands of records can be modified in a single
operation.
- Image:
- Animals with different classifications in the two trees
- Answer:
- We experimentally cleared the selected cells in the above image.
- Afterwards about 1000 animals did not have different
classifications anymore.
-
Saving visualization settings for future
reference
- Rating:
- Strength
- Process:
- The navigation history is stored as a sequence of commands.
- A command sequence can be stored as a named query.
- In order to perform a query later, InfoZoom executes the stored
navigation commands.
- Image:
- Answer:
- Not Applicable
-
Keeping the history of your analysis,
reviewing it and replaying it with different parameters
- Rating:
- Strength
- Process:
- The navigation history is stored as a sequence of commands.
- Using the back and forward buttons we can get
an animated replay of our interaction.
- The buttons also have an associated menu that shows the command
history.
- It can be used to jump directly to a saved state.
- Image:
- Answer:
- Not Applicable
-
Phylogenies: Application specific tasks
This data set was not analyzed with InfoZoom.
Classifications: Application specific tasks
To
what extent are the differences in the classifications due to
differences
in how animals are thought to be related? Are there other kinds of
differences and can you explain them?
- Rating:
- Strength
- Process:
- There are two kinds of differences:
- Some animals exist in only one of the trees
- Some animals are differently classified in the two trees
- These differences can be exactly determined:
- Latin Name uniquely identifies an animal within each tree.
- We define a derived attribute Count(Tree) per Latin Name.
- The resulting value is 2 for most of the animals.
- This means that they are contained in both of the trees.
- But some animals are found in only one tree.
- We can zoom on them by clicking on the 1.
The derived attribute Latin Path is the full path name of each
animal.
- It is similar to a fully qualified file name.
We define Count(Latin Path) per Latin Name and zoom on the
animals where the result is 2.
- These have a different classification in the two trees.
We can also use color coding in order to highlight the areas of
the overall trees where there are different classifications.
- To achieve this, we specify that the attribute Count(Latin
Path) per Latin Name defines the coloring.
- Each cell is now colored according to the average value of Count(Latin
Path) per Latin Name of the animals it represents.
- Therefore, red areas have less differences than the average,
green areas
contain more differences.
- We also defined several attributes like
- Count(Phylum) per Latin Name
- Count(Class) per Latin Name
- Count(Family) per Latin Name
in order to spot the differences more precisely.
- Image:
- Animals contained in only one of the trees
- Animals with a different classification in the two trees
- Color Coding of the frequency of different path names
- Answer:
- In the first image we can see that mainly chordates are found in
only one tree.
- In the second image we can see that a main reason for different
paths are some subclasses and infraclasses not used in tree B at all.
- In the third image the green cells, mainly Chordata and
especially Aves, contain many animals with two different
classifications.
- Using Count(Phylum) per Latin Name we detected that
the 17 animals of Genus Apus, even belong to two different Phylums,
namely
Chordata in A, but Arthropoda in B!
- Ensifera ensifera also belongs to two different Phylums
(chordates/arthropods) because it is not clear whether it is a bird or
an insect.
- Among others 2967 perching birds belong to different families.
-
Can you say in how many different subtrees
a particular common name (such as "dolphin" or "horse") is used? How
closely are these animals related? Are common names a good guide to
understanding relationships?
- Rating:
- Strength
- Process:
- We perform a full-text search in all attributes for "dolphin".
- This zooms on all animals that contain "dolphin" in at
least
one of its attributes.
- Image:
- Find-Dialog
- Result of full-text search for "dolphin"
- Result of full-text search for "horse"
- Answer:
- The search for "dolphin" returns 54 animals:
Many marine dolphins and river dolphins, but also a
bird called Myzomela adolphinae and a clam called Nucula
delphinodonta.
- The common family names marine dolphins and river
dolphins correspond to the Latin family names Delphinidae, Iniidae,
and Platanistidae.
In this case the common names reveal a relationship not reflected in
the Latin names.
In general, however, common names are not very useful because for most
animals there is no common name at all.
- The search for "horse" results in more than 1000 animals (in A
and B), most of them are seahorses and horse flies.
The family horses contains only 14 animals, namely the mammals.
-
How many species or subspecies are named
after biologists named "Townsend"?
- Rating:
- Strength
- Process:
- We perform a full-text search in Latin Name and Common
Name for "townsend".
- Image:
- Answer:
- In Tree A there are 48 Latin Names and 15 Common Names which
contain "townsend".
-
What kind of feedback does your tool
provide to alert the user quickly when a wrong name is entered?
- Rating:
- Strength
- Process:
- We perform a full-text search in all attributes for "Spirurida" and
then "Spirulida".
- Image:
- Result of full-text search for "Spirurida"
- Result of full-text search for "Spirulida"
- Answer:
- The first image clearly shows that the expected result was not
obtained.
-
For the top five subtrees with the most
nodes-- are they likely to have a parent of a particular rank? Or does
this happen in many ranks? Can you comment on how useful "rank" is?
- Rating:
- Strength
- Process:
- We do not completely understand the question.
- We try to answer it anyway.
The size of subtrees is proportional to the width of the cells.
- Looking at the complete tree A we see several large cells
at
different levels.
- Image:
- Answer:
- The family Formicidae is quite large, even larger than
many of the classes and phylums.
- The 5 largest subtrees are the Phylum Arthropoda,
the Subphylum Hexapoda, the Class Insecta, the
Subclass Pterygota, and the Superorder Neoptera.
-
File system and usage logs: Application
specific tasks
Introduction
As with the animal
tree, we had to transform the XML files to an object/attribute table.
Each leaf of the tree, i.e. each file, constitutes a column of the
table.
Each row of the table corresponds to a file attribute.
The inner nodes are
the
directories. They are also represented as attributes of each file:
- File
path is the complete path name of a
file, e.g. /class/fall2002/cmsc414/index.html.
- Name is the part after the
last slash, e.g. index.html.
- Directory
path is the part before the last slash,
e.g. /class/fall2002/cmsc414.
Five complete
snapshots
of the file system have been taken at the end of weeks A to E.
They were all combined into one large table.
In order to get a
first
overall impression of the whole data set, we start in InfoZoom’s Overview
Mode.
Other than in the Compressed Table Mode, the data set is not
visualized as a table here.
Instead each row independently shows the value distribution of an
attribute.
The size of each cell is proportional to number of files with that
value.
The following observations can be made:
- There are 343,614 file/week pairs.
- The snapshots from the 5 weeks have roughly the
same number of files.
- The most frequent name is index.html(see selected
cell and edit line above the table)
- html/htm and gif are the most frequent
file
types.
- users, class, usershollings, and projects
are the largest toplevel directories.
- On lower levels lectures is the
most frequent directory name.
- More than 50% of the files were never hit, about
90% less than 6 times.
- Userid 834 is the owner of the
most
files.
- Size and userid are missing for about 20% of the
files.
Browsing
is performed by interactively zooming into sub areas.
This is
demonstrated in a video.
Click to see
video
For example, we can
double-click the cell containing the value A, to zoom on the first snapshot only.
The screen shot shows the result after a second zoom on projects
and a switch to the Compressed Table Mode:
We
can observe the following:
- The projects hcil and hpsl contain the most files.
- The
directory jazz-chat contains
mainly html-files.
- hpsl contains a large directory
called ppt.
Where are
the big
directories?
There are several different interpretations of this question:
- Which toplevel directories do contain the highest number of files?
- Which toplevel directories do occupy the most disk space?
- Which directories do directly contain the highest number of files?
- In which directories do the directly contained files occupy the
most disk space?
Interpretation A: Which toplevel directories do contain the highest
number of files?
Alternative 1:
- Rating:
- Strength
- Process:
- Big directories are immediately visible since the width of each
cell is proportional to the number of files it represents.
- Image:
- Answer:
-
- It's obvious that users and class are the
biggest toplevel directories.
- The size of lower level directories can be observed in the
same way
Alternative 2:
- Rating:
- Strength
- Process:
- The value list for toplevel directory can be sorted by
frequency. The frequency is identical to the number of files.
- Image:
- Answer:
-
- users contains 18723 files.
- class contains 16978 files.
Interpretation B: Which toplevel directories do occupy the most
disk space?
Alternative 1:
- Rating:
- Strength
- Process:
- Define a derived attribute Sum(size in KB) per toplevel
directory and exclude the unknown sizes.
- Image:
- Answer:
-
-
- class occupies 1,4 GB
- users occupies 1,1 GB
Alternative 2:
- Rating:
- Strength
- Process:
- Declare size in KB as the attribute that determines the
column width of the table.
- Normally, all columns have the same width, and therefore the
width
of each cell is proportional to the number of files it represents.
- In the image below, however, the width of each cell is
proportional to the total size of the files it represents.
- Image:
Answer:
- We can easily see that projects is the biggest toplevel
directory, mainly because of the papers in pdf-format.
- In /movies/monte there are some very large individual
files (see large cells for the attribute file path).
Interpretation C: Which directories do directly contain the highest
number of files?
- Rating:
- Strength
- Process:
- Define a derived Count(file path) per directory path
and zoom on its highest values.
- Image:
- Answer:
-
- / projects / hcil / jazz / list-archives / jazz-chat
contains 950 files
-
- / users / building / pitr / pics / tn contains
510 files
Interpretation D: In which directories do the directly contained
files occupy the most disk space?
- Rating:
- Strength
- Process:
- Define a derived Sum(size in KB) per directory path
and zoom on its highest values.
- Image:
- Video: Click to see
Video
Answer:
- We can see that /movies/monte and /projects/SoftEng/ESEG/papers
are the two biggest directories.
-
Can you see different patterns in the
files? (Can you make out the difference between personal pages, class
pages and research project pages?)
- Rating:
- Strength
- Process:
- Zoom on each of the four largest toplevel directories.
- Image:
- Toplevel directories
- Toplevel directory users
- Toplevel directory projects
- Toplevel directory class
- Answer:
-
- users, projects, class and usershollings are
the biggest toplevel directories.
- /users/hollings seems to be very similar to usershollings.
The most other user files also can be found at two places.
- For the files in the /users<name> directories
hitCount is 0 and there is no information about file size
and creation/modification time.
- hcil and hpsl are the projects with the most
files, mainly because of the large directories /projects/hcil/jazz/list-archives/jazz-chat
and /projects/hpsl/classes/818s-s98/ppt/titan.
-
Were there a lot
of pages created recently? If so, in which part of the file system?
- Rating:
- Strength
- Process:
- We used the creation and modification
times of the files to answer this question.
- First of all we defined the derived
attribute mtime >= ctime. Somewhat surprising the
result was false in most
cases.
- On the other hand ctime
>= mtime is true in 99.9% of the
cases.
(There are 59 exceptions!?)
- So obviously the two attributes ctime
and mtime had been swapped by
mistake in the data set.
We corrected this error and defined a
few attributes derived from ctime and mtime:
-
- It can be seen that the vast majority
of the files where created and modified before the first snapshot.
- Most files where even modified at the
same day in August 2002!
- In order to answer the question we
zoomed on all files with a creation day 2003/1/25 or
later in snapshot E.
- Image:
- Answer:
- 1,435 files have been created after 2003/1/24.
- In the class directory many new files were created,
mainly in the subdirectory spring2003.
- In Library many new bib files were created.
- In project/hcil there are also a lot of new files.
Are the newer directories bigger than the
older projects?
- We compared the number of files in the subdirectories of project in week A and E.
- Rating:
- Strength
- Process:
- We defined a new attribute Count(file path) per level2 and
week.
- Next we zoomed on the weeks A and E and then on projects.
- Image:
- Answer:
- The image shows the three biggest
directories. None of them grew between A and E.
-
When was the page giving directions to the
department last updated?
- Rating:
- Strength
- Process:
- A full-text search for "directions" shows a few files containing
that string in their names.
Clicking on the toplevel directory department
shows the result.
- Image:
-
- Answer:
- The file /department/directions.shtml was
last modified on 2002/8/31, 03:43:19
-
Which are the popular webpages?
Alternative 1:
- Rating:
- Strength
- Process:
- In order to find the most popular
files we simply zoom on the highest values of the attribute hitCount.
- Image:
- Click to see video
- Answer:
- It turns out that /index.html and /index.shtml have the most hits, which
is
not very surprising
Alternative 2:
- Rating:
- Strength
- Process:
- A more interesting question is to find the most popular pdf-files.
- To accomplish this, we simply double-click pdf before
zooming on the highest values of hitCount.
- Image:
- Click to see video
- Answer:
- / users / chiraz / thesis.pdf had 127 hits in
week A
- / class / spring2002 / cmsc818m / doc / 0220 /
expanding.pdf had 210 hits
- / Library / TRs / CS-TR-4405 / CS-TR-4405.pdf
had 251 hits.
Alternative 3:
- Rating:
- Strength
- Process:
- We can also use an attribute-dependent column width here.
In the image below, the width of each cell is proportional to the
hitCount.
So large cells represent popular web pages.
Moreover, we have defined a derived attribute Sum(hitCount) per toplevel
directory
and we have sorted the table by this new attribute.
- Image:
-
- Answer:
- We can observe that projects
is the most popular top level directory and hcil is the most popular project, but mainly because
of the banner-images in gif-format.
Are there some labs more popular than
others?
- Rating:
- Strength
- Process:
- Zoom on projects and define Sum(hitCount) per level 2.
- Image:
- Answer:
- hcil, plus, and hpsl are the most popular
labs.
Which areas are getting more popular? Less
popular?
- Rating:
- Strength
- Process:
- We define the attribute hitCount as the color giving
attribute (indicated by the traffic lights left of its attribute name).
Then we zoom into weeks A and
E.
This shows cells with more hits than the average in green and cells
with fewer hits than the average value in red.
Comparing weeks A and E the change of color can
indicate that the number of hits increased or decreased.
- Image:
- Answer:
- The toplevel directory class gets about three times
more hits in week E than in
week A (from 2.11 to 6.46).
Directory users decreased from 3.06 to 2.87 in average.
Are new pages more popular that old pages?
- Rating:
- Strength
- Process:
- We defined a new attribute creation
year derived from creation
day,
and another derived attribute Average(hitCount) per creation
year:
- Image:
- Answer:
- It turned out that the average hit
count in general is lower for old files. The year 2000 is an exception.
Another exception are the 3 files created in 1993. These have an
average hit
count of about 60. Mainly because the file /users/samir/khuller.gif was
hit
129 times.
Which old pages are popular?
- Rating:
- Strength
- Process:
- We zoomed on the files created in
1999
or before and on the highest hit counts.
This showed some banner images in GIF-format.
- It is more interesting to look for
popular papers. Therefore we focused on ps- and pdf-files.
- Moreover, we defined the derived
attribute Sum(hitCount) per file path to sum up the hit
counts of the 5 weeks for each file.
- Image:
- Answer:
- See image.
What proportion of the pages are never
used?
- Rating:
- Strength
- Process:
- The distribution of hit counts can be directly seen.
- Image:
- Answer:
- About 50% of the pages are never used.
What proportion of the pages are seldom
used?
- Rating:
- Strength
- Process:
- The distribution of hit counts can be directly seen.
- Image:
- Answer:
- About 90% of the pages are used 5 times or less per week.